Spam detection remains one of the most critical challenges in modern digital communication, with adversarial spam evolving continuously to evade rule-based and conventional statistical filters. This paper presents SpamShield, a production-grade, multilingual spam detection system that fine-tunes the XLM-RoBERTa transformer architecture on a curated multilingual dataset encompassing English, Hindi, and more than ten additional languages. The system is designed around a three-phase pipeline: data consolidation and language-aware preprocessing (Phase 1), transformer-based model training with evaluation (Phase 2), and a real-time REST API deployment using FastAPI (Phase 3). The proposed approach achieves 98.16% classification accuracy, an F1-score of 0.9563, and an AUC-ROC of 0.9911 on a held-out test set of 3,266 samples. A responsive single-page web application enables real-time inference with automatic language detection. The architecture is modular, scalable, and readily extendable to additional languages and communication modalities, making SpamShield a practical foundation for enterprise-grade spam filtering.
Introduction
The proposed SpamShield system is an advanced multilingual spam detection framework designed to address the limitations of traditional spam filters against modern phishing and spam attacks. Conventional approaches such as blacklisting, keyword matching, and Bayesian classifiers struggle with adversarial techniques like content obfuscation, multilingual text, and dynamic spam generation. To overcome these challenges, SpamShield employs a fine-tuned XLM-RoBERTa transformer model capable of understanding semantic information across more than 100 languages. The system follows a three-phase pipeline comprising multilingual dataset preparation, model fine-tuning, and deployment through a modular FastAPI backend with a web interface. It supports automatic language detection, real-time spam classification, and easy integration with third-party communication platforms.
The dataset was created by consolidating multiple public SMS and email spam datasets, resulting in a balanced multilingual corpus containing 16,330 messages across English, Hindi, and 11 additional languages. After preprocessing and language annotation, the XLM-RoBERTa model was fine-tuned using an 80:20 train-test split with optimized hyperparameters. The trained model is deployed as a REST API that performs language detection, tokenization, spam prediction, and confidence estimation, returning structured responses that include predicted class, spam probability, detected language, and processing time.
SpamShield adopts a three-tier architecture consisting of a web-based frontend, a FastAPI service layer, and a separate model and data layer, ensuring scalability, modularity, and efficient inference. The frontend provides an intuitive interface with real-time confidence visualization and model performance metrics, while the backend processes requests and performs inference using the fine-tuned transformer model.
Experimental evaluation demonstrates that SpamShield achieves 98.16% accuracy, 96.01% precision, 95.27% recall, 95.63% macro F1-score, and an AUC-ROC of 0.9911 on a hold-out test set of 3,266 messages. Compared with existing spam detection methods, it offers competitive accuracy while uniquely providing robust multilingual support across more than 13 languages. Furthermore, inference latency remains below 100 ms on CPU and 10–20 ms on GPU, making the system suitable for real-time deployment. Overall, SpamShield provides a scalable, production-ready, multilingual spam detection solution that effectively combines high classification performance with practical deployment capabilities.
Conclusion
This paper presented SpamShield, a unified multilingual spam detection system that leverages the cross-lingual transfer capabilities of the XLM-RoBERTa transformer to classify SMS and email messages across more than thirteen languages. The system was developed through a rigorous three-phase pipeline covering data consolidation, transformer fine-tuning, and REST API deployment. On a held-out test set, the model achieved 98.16% accuracy, an F1-score of 0.9563, and an AUC-ROC of 0.9911, demonstrating competitive performance relative to prior monolingual state-of-the-art systems while substantially broadening language coverage.
The modular three-tier architecture—separating the presentation layer, the FastAPI service layer, and the model persistence layer—ensures that individual components can be independently updated, scaled, or replaced without disrupting the end-to-end pipeline. The publicly accessible REST API with automatic language detection enables straightforward integration into existing email gateways, mobile SMS applications, and enterprise communication platforms.
SpamShield demonstrates that pre-trained multilingual transformers offer a highly effective and practical path to building spam detectors that meet the linguistic diversity requirements of real-world deployments, particularly in markets such as India where multiple languages coexist in digital communication channels.
References
[1] O. Oluwatoyin, A. Bodunde, G. Titus, and G. Aderounmu, \"An Improved Machine Learning-Based Short Message Service Spam Detection System,\" Int. J. Computer Network and Information Security, vol. 12, pp. 40–48, 2019, doi: 10.5815/ijcnis.2019.12.05.
[2] S. Douzi, F. A. AlShahwan, M. Lemoudden, and B. E. Ouahidi, \"Hybrid Email Spam Detection Model Using Artificial Intelligence,\" Int. J. Mach. Learn. Comput., vol. 10, no. 2, pp. 316–322, 2020, doi: 10.18178/ijmlc.2020.10.2.937.
[3] T. Sultana, K. A. Sapnaz, F. Sana, and J. Najath, \"Email Based Spam Detection,\" Int. J. Eng. Res. Technol., vol. 9, no. 6, pp. 595–599, Jun. 2020, doi: 10.17577/IJERTV9IS060087.
[4] T. J. Rani, T. J. Vumesh, P. Saiteja, V. A. K. Reddy, and M. Meghana, \"SMS Spam Detection Framework Using Machine Learning Algorithms and Neural Networks,\" Int. J. Comput. Sci. Mobile Comput., vol. 10, no. 6, Jun. 2021, doi: 10.47760/ijcsmc.2021.v10i06.002.
[5] I. H. Hussin, L. S. I. Nazarudin, and N. A. N. Azman, \"Comparative Analysis of Machine Learning and Deep Learning for Email Spam Detection,\" TechRxiv Preprint, 2021, doi: 10.36227/techrxiv.172115119.92836191.
[6] H. C. Altunay and Z. Albayrak, \"Deep Learning Architectures for Multilingual SMS Spam Detection,\" Appl. Sci., vol. 14, no. 24, 2022, doi: 10.3390/app142411804.
[7] N. Ahmed, R. Amin, H. Aldabbas, D. Koundal, B. Alouffi, and T. Shah, \"Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges,\" Security and Communication Networks, vol. 2022, pp. 1–27, 2022, doi: 10.1155/2022/1862888.
[8] M. R. Al Saidat, S. Y. Yerima, and K. Shaalan, \"Advancements of SMS Spam Detection: A Comprehensive Survey of NLP and ML Techniques,\" Faculty of Engineering & IT, The British University in Dubai, Survey Article, 2023.
[9] E. Sankar, Y. Y. S. Babu, and M. Tridev, \"SMS Spam Detection Using Machine Learning,\" Int. J. Scientific Research in Engineering and Management (IJSREM), vol. 7, no. 4, Apr. 2023, doi: 10.55041/IJSREM18832.
[10] F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, E. Fidalgo, and E. Alegre, \"A Review of Spam Email Detection: Analysis of Spammer Strategies and the Dataset Shift Problem,\" Artificial Intelligence Review, vol. 56, pp. 1145–1173, 2023, doi: 10.1007/s10462-022-10195-4.
[11] Y. Conneau, C. Khandelwal et al., \"Unsupervised Cross-lingual Representation Learning at Scale,\" in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 8440–8451.
[12] R. Arakh, A. Kumar, A. Mishra, A. Patel, and A. Srivas, \"SMS Spam Detection using Machine Learning,\" Int. J. Innov. Res. Technol., vol. 10, no. 12, May 2024.